160 research outputs found

    Transfer from Multiple MDPs

    Get PDF
    Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them into the training set used to solve a given target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.Comment: 201

    Experimental results : Reinforcement Learning of POMDPs using Spectral Methods

    Get PDF
    We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through epochs, in each epoch we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the epoch, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spai

    Best-Arm Identification in Linear Bandits

    Get PDF
    We study the best-arm identification problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter θ\theta^* and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies that pull arms to identify the best arm with a fixed confidence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the GG-optimality criterion used in optimal experimental design.Comment: In Advances in Neural Information Processing Systems 27 (NIPS), 201

    Active Learning for Accurate Estimation of Linear Models

    Get PDF
    We explore the sequential decision making problem where the goal is to estimate uniformly well a number of linear models, given a shared budget of random contexts independently sampled from a known distribution. The decision maker must query one of the linear models for each incoming context, and receives an observation corrupted by noise levels that are unknown, and depend on the model instance. We present Trace-UCB, an adaptive allocation algorithm that learns the noise levels while balancing contexts accordingly across the different linear functions, and derive guarantees for simple regret in both expectation and high-probability. Finally, we extend the algorithm and its guarantees to high dimensional settings, where the number of linear models times the dimension of the contextual space is higher than the total budget of samples. Simulations with real data suggest that Trace-UCB is remarkably robust, outperforming a number of baselines even when its assumptions are violated.Comment: 37 pages, 8 figures, International Conference on Machine Learning, ICML 201

    Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

    Get PDF
    While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This leads to defining weakly-communicating or multi-chain MDPs. In this paper, we introduce \tucrl, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with SCS^{\texttt{C}} communicating states, AA actions and ΓCSC\Gamma^{\texttt{C}} \leq S^{\texttt{C}} possible communicating next states, we derive a O~(DCΓCSCAT)\widetilde{O}(D^{\texttt{C}} \sqrt{\Gamma^{\texttt{C}} S^{\texttt{C}} AT}) regret bound, where DCD^{\texttt{C}} is the diameter (i.e., the longest shortest path) of the communicating part of the MDP. This is in contrast with optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to bias the exploration to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art

    Second-Order Kernel Online Convex Optimization with Adaptive Sketching

    Get PDF
    Kernel online convex optimization (KOCO) is a framework combining the expressiveness of non-parametric kernel models with the regret guarantees of online learning. First-order KOCO methods such as functional gradient descent require only O(t)\mathcal{O}(t) time and space per iteration, and, when the only information on the losses is their convexity, achieve a minimax optimal O(T)\mathcal{O}(\sqrt{T}) regret. Nonetheless, many common losses in kernel problems, such as squared loss, logistic loss, and squared hinge loss posses stronger curvature that can be exploited. In this case, second-order KOCO methods achieve O(log(Det(K)))\mathcal{O}(\log(\text{Det}(\boldsymbol{K}))) regret, which we show scales as O(defflogT)\mathcal{O}(d_{\text{eff}}\log T), where deffd_{\text{eff}} is the effective dimension of the problem and is usually much smaller than O(T)\mathcal{O}(\sqrt{T}). The main drawback of second-order methods is their much higher O(t2)\mathcal{O}(t^2) space and time complexity. In this paper, we introduce kernel online Newton step (KONS), a new second-order KOCO method that also achieves O(defflogT)\mathcal{O}(d_{\text{eff}}\log T) regret. To address the computational complexity of second-order methods, we introduce a new matrix sketching algorithm for the kernel matrix Kt\boldsymbol{K}_t, and show that for a chosen parameter γ1\gamma \leq 1 our Sketched-KONS reduces the space and time complexity by a factor of γ2\gamma^2 to O(t2γ2)\mathcal{O}(t^2\gamma^2) space and time per iteration, while incurring only 1/γ1/\gamma times more regret

    Sequential Transfer in Multi-armed Bandit with Finite Set of Models

    Get PDF
    Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of \textit{sequential transfer in online learning}, notably in the multi-armed bandit framework, where the objective is to minimize the cumulative regret over a sequence of tasks by incrementally transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for the estimation of the possible tasks and derive regret bounds for it

    Learning with stochastic inputs and adversarial outputs

    Get PDF
    International audienceMost of the research in online learning is focused either on the problem of adversarial classification (i.e., both inputs and labels are arbitrarily chosen by an adversary) or on the traditional supervised learning problem in which samples are independent and identically distributed according to a stationary probability distribution. Nonetheless, in a number of domains the relationship between inputs and outputs may be adversarial, whereas input instances are i.i.d. from a stationary distribution (e.g., user preferences). This scenario can be formalized as a learning problem with stochastic inputs and adversarial outputs. In this paper, we introduce this novel stochastic-adversarial learning setting and we analyze its learnability. In particular, we show that in a binary classification problem over an horizon of nn rounds, given a hypothesis space H\mathscr{H} with finite VC-dimension, it is possible to design an algorithm that incrementally builds a suitable finite set of hypotheses from H\mathscr{H} used as input for an exponentially weighted forecaster and achieves a cumulative regret of order O(\sqrt{n VC\mathscr{H} log n})$ with overwhelming probability. This result shows that whenever inputs are i.i.d. it is possible to solve any binary classification problem using a finite VC-dimension hypothesis space with a sub-linear regret independently from the way labels are generated (either stochastic or adversarial). We also discuss extensions to multi-class classification, regression, learning from experts and bandit settings with stochastic side information, and application to games

    Truthful Learning Mechanisms for Multi-Slot Sponsored Search Auctions with Externalities

    Get PDF
    Sponsored search auctions constitute one of the most successful applications of microeconomic mechanisms. In mechanism design, auctions are usually designed to incentivize advertisers to bid their truthful valuations and to assure both the advertisers and the auctioneer a non-negative utility. Nonetheless, in sponsored search auctions, the click-through-rates (CTRs) of the advertisers are often unknown to the auctioneer and thus standard truthful mechanisms cannot be directly applied and must be paired with an effective learning algorithm for the estimation of the CTRs. This introduces the critical problem of designing a learning mechanism able to estimate the CTRs at the same time as implementing a truthful mechanism with a revenue loss as small as possible compared to an optimal mechanism designed with the true CTRs. Previous work showed that, when dominant-strategy truthfulness is adopted, in single-slot auctions the problem can be solved using suitable exploration-exploitation mechanisms able to achieve a per-step regret (over the auctioneer's revenue) of order O(T1/3)O(T^{-1/3}) (where T is the number of times the auction is repeated). It is also known that, when truthfulness in expectation is adopted, a per-step regret (over the social welfare) of order O(T1/2)O(T^{-1/2}) can be obtained. In this paper we extend the results known in the literature to the case of multi-slot auctions. In this case, a model of the user is needed to characterize how the advertisers' valuations change over the slots. We adopt the cascade model that is the most famous model in the literature for sponsored search auctions. We prove a number of novel upper bounds and lower bounds both on the auctioneer's revenue loss and social welfare w.r.t. to the VCG auction and we report numerical simulations investigating the accuracy of the bounds in predicting the dependency of the regret on the auction parameters
    corecore